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Abstract — For the additive Gaussian noise channel with aver- 
age codeword power constraint, sparse superposition codes and 
adaptive successive decoding is developed. Codewords are linear 
combinations of subsets of vectors, with the message indexed by 
the choice of subset. A feasible decoding algorithm is presented. 
Communication is reliable with error probability exponentially 
small for all rates below the Shannon capacity. 

I. Introduction 

Sparse superposition codes with computationally feasible 
decoding is shown to achieve exponentially small error prob- 
ability for any rate below the capacity. A companion presen- 
tation 15J gives bounds for optimal least squares decoding. 

Code construction is by linear combination of vectors of 
length n. Let Xi^X2, ■ ■ ■ , X]\f be a dictionary of such vectors. 
Organize it in a matrix X of N ^ BL columns, partitioned 
into L sections of size B a power of 2. Codewords are 
superpositions Xp — PjXj with each section having 1 
term non-zero. The set of such /3 is not closed under linear 
combination, so these are not linear codes in the algebraic 
coding sense. Nevertheless, they are fast to code and decode. 

The message is conveyed by the choice of the subset of 
L terms, with one from each section. From an input bit 
string u — {ui,U2, ■ ■ ■ ,uk), with K = Llogj-B, encoding 
is realized by regarding u as a concatenation of L numbers, 
each with \ogB bits, specifying the selected columns. The 
codewords c = X/3 have power (l/rt) J2"=i '^h which will be 
near P when averaged across the 2^ possible codewords. The 
received vector is F = X(3 + e with e distributed N(0, <t^I). 

A decoder maps the received vector into an estimate u. 
With sent ~ {ji,j2, ■ ■ ■ , Jl) being the terms sent, the decoder 
produces estimates ji, J2, ■ • • , Jl- Overall block error is the 
event u ^ u and section error is the event ji ^ jg. The fraction 
of section mistakes is {l/L)Y,e=i '^{j.^je}- 

The reliability requirement is that the mistake rate is small 
with high probability or the block error probability is small, 
averaged over input strings u as well as the distribution of Y. 
The supremum of reliable communication rates R = K/n is 
the channel capacity C = (1/2) log2(l+P/cr^), as in El, ifTOl . 

The challenge is to achieve arbitrary rates below the capac- 
ity, with reliable decoding in manageable computation time. 
Here communication rates are identified which are moderately 
close to the capacity and a fast decoding scheme is devised. It 
is demonstrated to have probability that is exponentially small 
in L/{logB)^ of there being more than a moderately small 
fraction of section mistakes. 



The setting adopted is the discrete-time channel with real- 
valued inputs and outputs and independent Gaussian noise. 
Standard communication models have been reduced to this 
setting as in |16|, |14|, when there is a frequency band con- 
straint with specified noise spectrum. Solution to the coding 
problem, married to appropriate modulation, is relevant to 
myriad settings involving transmission over wires or cables 
for internet, television, or telephone or in wireless radio, 
TV, phone, satellite or other space communications. Previous 
standard approaches, as discussed in |14|, entail a decompo- 
sition into separate problems of modulation, of shaping of a 
multivariate signal constellation, and of coding. Though there 
are practical schemes with empirically good performance, 
theory for practical schemes achieving capacity is lacking. In 
our analysis, shaping is built directly into the code design. 

The entries of X are generated with the independent stan- 
dard normal distribution. The coefficients are /3j equal to 
^P{ii) for j = je in sent and equal to otherwise, with sum 
of squares X^fci — P matching the power constraint. In 
the simplest case, the same power is allocated to each section 
P{i) = P/L- We also consider the choice of variable power 
with P(^) proportional to e^^*^^/^ and a slight variant of this 
allocation in which the power is variable across most I and 
then levels for l/L near 1. 

For a rate R code, nR ~ L log B, so the codelength n and 
the subset size L agree to within a log factor Setting L = B is 
sensible, or, for a target codelength n, one may set i? = n and 
L ~ nR/ log n. For the best case developed here, the rate R is 
chosen to have a drop from capacity that is near l/logS, to 
within a loglog factor When the signal to noise ratio is large, 
one finds it desirable to arrange logi? to be at least as large 
as C to achieve at least a constant fraction of capacity. 

Let's summarize our findings. With constant power allo- 
cation, a two-step algorithm and a multi-step improvement 
reliably achieve rates up to a rate Rq — {1/2)P/{P + a^) 
less than capacity. With variable power and order log B steps, 
we bring the achievable rate up near capacity C, albeit with 
a gap from capacity of order l/^/\ogB. With the variant in 
which the power is leveled for i/L near 1, the gap from 
capacity is reduced to order 1/ log B, to within a loglog factor, 
and, moreover, the section mistake rate is less than a constant 
times 1 / log B, except in an event of probability exponentially 
small in L/{\ogB)^, as we report here. Subsequent to the 
submission of this conference paper, we have refined this 
probability bound, obtaining that it is exponentially small 



in L/{\ogB), or equivalently n/(logn)^, to within a loglog 
factor, as will be reported in the upcoming journal submission. 

The performance, as measured by the gap from capacity 
at a similar reliability level, is comparable to benchmarks of 
performance for schemes not demonstrated to be practical, 
including |5| for least squares decoding of related superpo- 
sition codes, and 1241 for theoretically optimal codes. For a 
gap from capacity of order 1 / log n, the best error probability 
is exponentially small in n/(logn)^. 

The decoder initially computes for the received Y, its inner 
product with the terms in the dictionary, and sees which are 
above a threshold. Such a set of inner products and compar- 
isons is performed in parallel by a basic computational unit, 
e.g. a signal-processing chip with parallel accumulators, in 
time of order n. These are pipelined so that the inner products 
are updated in constant time as each element of Y arrives. 

The threshold, set high enough that incorrect terms are 
unlikely to be above threshold, leads to only a small fraction 
of terms decoded in any one such step. Additional steps are 
used to bring the total fraction decoded near 1. These steps 
take the inner products with residuals of the fit from the terms 
previously above threshold. A variant of the inner product with 
residuals is found to be somewhat more amenable to analysis. 

The decoder does not predetermine which sections are to 
be decoded on any one step, rather it adapts the choice in 
accordance with which has inner product observed to be above 
threshold. Thus we call it adaptive successive decoding. 

We determine a function g{x) mapping from [0, 1] into 
[0, 1], which has the role that if Xk^i is a likely fraction of 
sections correctly decoded from previous steps up to fc— 1 then 
g{xk^i), slightly adjusted, provides a value Xk of total fraction 
of sections likely to be correctly decoded by step k. This 
function depends on the power allocation rule and the choice 
of rate. A choice of communication rate is acceptable if the 
function g{x) is greater than x over most of the interval. Such 
a function g is said to be accumulative, allowing the succession 
of steps to build up a large fraction of correctly decoded 
sections, with only a small fraction of mistakes remaining. 
The role of g{x) is illustrated in Figure 1. 

Our analysis provides summary formulas for the rate and the 
target fraction of mistakes that arise from bounding the extent 
of positivity of g{x) — x. These summary formulas provide 
proof of a favorable scaUng of rate by our scheme for the 
particular reliability targets, indexed by the size of the code. 

Moreover, the function g{x) can be evaluated in detail to 
choose settings of parameters (a, c, and 7 below). This allows 
computation of the best communication rate our analysis 
achieves, for given error probability and target mistake rates. 

The parameter a arises in the threshold r ~ \J2 log B + 
a of the standardized inner products. The parameter c sets 
the height at which the variable power is leveled, with power 
P(f) chosen to be proportional to max{e^^''*^^~^^/'^, cut}, with 
cut — e^^''(l + 5c) where = c/V21ogi?. 

Allowing power proportional to maxje"^'''^"^'/^, cut}, 
with cut = e~^'''(H-5c), for 7 between and C, interpolates 
between the constant and variable cases. 




Fig. 1. Plots of g(x) and the sequence x\^. For snr = 15 the plot takes 
a = 0.86, c = 1.6 and 7 = 0.8C and the tinal false alarm and failed 
detection rates are 0.026 and 0.013 respectively, with probability bound of at 
least that fraction of mistakes equal to 0.002. For snr = 1, constant power 
allocation is used with a = 0.56 and the false alarm and failed detection 
rates are 0.026 and 0.053 respectively, with probability bound 0.0007. 



Fig. 2. Curve showing achieved rates as a function of B for snr = 15 and 
snr = 1. The x-axis has B plotted on the log scale. 

Figure 2 plots the rate i? as a function of B, from opti- 
mization of a, c, and 7, maintaining the bound 10^^ on the 
probability of a fraction of mistakes exceed 0.10. Both the 
case L ~ B, and a large L limit are shown as well as some 
results of simulation of the algorithm with L = 100. 

Signed superposition coding in which the £'th non-zero 



coefficient value is ±, 



increases the number of code- 



words to (2_B)^ with the same reliability bounds, thereby 
improving the rate by a factor of 1 + (log2)/(logi3), above 
what is shown in Figure 2. Arbitrary L term subset coding 
(without partitioning) is possible, though not as simple, for 
a total rate improvement by a 1 + (log 2e) / (log B) factor. 
For this presentation, we focus on the unsigned, partitioned 
superposition code case. 

To prevent block errors, our subset superposition codes 
combine with error correction codes. The idea is to arrange 
sufficient distance between the subsets. Consider composition 
with an outer Reed-Solomon (RS) code of rate 1 — 2(5 near 
one, for an overall rate {1—26)R. The alphabet of the RS code 
is taken to be of size B. Interpret its codewords as providing 
the sequence of labels ji,j2, • ■ • , Jl of the terms selected from 
the sections. The RS codelength L is taken to be either B —1 
or B using a standard extension. RS code properties as in ||23]| 
guarantee correction of any fraction of section mistakes less 
than S. For advocacy of code concatenation see |[T3l . As a 
consequence of our result for the inner code, the composite 
code makes no mistakes, except in an event inheriting the 
exponentially small probability in L/(logi?)^. 

A fascinating alternative approach is channel polarization 



ID. which achieves high rates for binary signaling with 
feasible decoding, with error probability exponentially small 
in n^/^. For our scheme the error probability is exponentially 
small in n^~' for any e > and communication is permitted 
at higher rates beyond that associated with binary signalling. 

Codes empirically demonstrated to be good include low 
density parity check codes and turbo codes, both with iterative 
statistical belief propagation decoding, but mathematically 
proof of performance near capacity is so far limited to special 
cases such as the binary erasure channel ll2Tll . Il22ll . 

Another approach to sparse superposition decoding is con- 
vex projection with £i constraint, arising from analogous prob- 
lems of statistical learning and signal recovery. Iterative proce- 
dures and properties for such projection are in 1 18 l.[3 l,l|20ll.Pl. 
ifTTJ . with preliminary findings for communication in ||29l. 
Each iteration would find in each section the term of highest 
inner product with the residuals and use it to update the convex 
combination. It is unclear to us whether convex projection for 
communication can be reliable at rates up to capacity. 

The conclusions may be expressed in the language of 
sparse signal recovery. L terms from a dictionary are linearly 
combined and subject to noise. For signals Xf3 recovery of the 
terms from the received noisy Y is possible provided the num- 
ber of observations n is at least {1/R)L log B. Recovery using 
£i constrained convex optimization is accurate provided R< 
Ro in the equal power case. For our variable power designs, 
our results establish recovery by other means at higher R<C. 
These findings complement work in ll30l.ll3l1.lfT2l.lfnil.ll7l. 
||25l,fT9l. For typical signal recovery problems there is greater 
freedom of design with non-zero coefficients values regarded 
as unknown, leading to bounds based on the minimum non- 
zero signal size, rather than exclusively based on the total 
signal power as in the communication capacity. 

Superposition codes began with |j9] for the broadcast chan- 
nel, and later for multiple-access channels f8l,f27l. Our pur- 
pose of computational feasibility is different from the original 
purpose of identifying the set of achievable rates. Another 
connection is the consideration of rate splitting and successive 
decoding. Our adaptive successive decoding yields feasibility 
in the single-user case and should work also in multi-user 
settings. 

II. The Decoder 

From the received Y and knowledge of the dictionary, 
decode which terms were sent by an iterative procedure. In 
the constant power allocation case set Pj = P/L. For the 
variable power case let Pj = P(^f^ for j in section £. 

First Step: For each term Xj of the dictionary compute the 
statistic Zi j- — XjY /||F||.The terms for which the statistic is 
above a threshold r = yj2 \og{B) + a are regarded as decoded 
terms. Denote the associated event Hj = {Zij > r}. The 
idea of the first step threshold is that very few of the terms 
not sent will be above threshold. Yet a positive fraction of the 
terms sent will be above threshold and hence will be correctly 
decoded on this first step, with an average likely to be at least 
a positive value q as will be quantified. 



Let deci ^ {j : 1-^. = 1} be the set of terms decoded on 
this step. The first step provides the fit Fi = J^j -^j^V-j- 

Second Step: For each of the remaining terms, form the inner 
product with the vector of residuals r = Y — Fi, that is, 
compute Xjr or its normalized form = Xjr /\\r\\. A 
quantity with similar properties is found to be equally easy to 
compute and somewhat simpler to analyze. Indeed, compute 
Fy = [FiY/\\Y\\^] Y which is the part of Fi in the direction 
Y and the vector G — Fi~ Fy which is the part orthogonal to 
Y . For each of the remaining j compute — XjG /\\G\\. 
Then form the combined test statistic 

with X ~ qP/ {a'^+P). The specified A is chosen to maximize 
the mean separation between correct and wrong terms. For the 
two-step version, complete the decoding, in each section not 
previously decoded, by picking the term for which this statistic 
is largest, with no need for a second step threshold in that case. 

Extension to Multiple Steps: We briefly describe how the 
algorithm is extended to multiple steps to provide increased 
reliability. The process initializes with Vij — Xj the vectors 
of terms in the dictionary with index set Ji consisting of all the 
terms. From the first step, Gi = F is the received vector and 
the statistics Zi j are XjGi/\\Gi\\ for j in Ji with associated 
events Hij = Hj. 

For the second step the vector 6*2 = G is formed, which 
is the part of Fi orthogonal to Gi = Y. The set of terms 
investigated on this step is J2 = Ji {j : l-Ui = 0}. For 
j in J2, the statistic Z2J = XJG2/IIG2II is computed as 
well as the combined statistic — \f\\Z\,j — \f\2Z2j, 

where Ai — 1 — A and A2 = A. What is different on the 
second step is consideration of the events T-L2.j — {Zfp^ > 
t} with the same threshold t, leading to an additional part 
= Ejej, of the fit Fi + F2. The second 

step provides some increase in separation, without attempting 
to resolve all in two steps. 

Proceed, iteratively, to perform the following loop of cal- 
culations, for k > 2. From the output of step fc — 1, there 
is available the partial fit vector Fk-i and for k' < k the 
previously stored vectors Gk' and statistics Z^'j at for j in the 
previous set Jk-i- Plus there is a set J^. of remaining terms for 
us to consider at step k. From the residual r = Y — fitk-i, one 
may compute ZJ^"^^ — Xjr/\\r\\. Instead, for simplification of 
the analysis, compute the part Gk of Fk^i orthogonal to the 
previous Gk/ and for each j in Jk compute 

Zk,j ^ XfGk/\\Gk\\ 
and the combined statistic 

Zcomb /1 \ rjcomb /\ ^t- 

k,j - V J- - AfcZj._i J - V-^feZfej, 

where the value of A^ we shall specify is again chosen to 
maximize a measure of separation between correct and wrong 
terms. The statistics Z^.'^^ are similar, entailing empirically 
determined values of Afc. The statistics Z^°™-^ are compared 



to the threshold, leading to the events %k,i 
The output of step k is the vector 



E 



n 



providing the update fitfc = fitfe_i+Ffe. Also the vector Gk and 
the statistics Zkj are appended to what was previously stored, 
at least for the terms j in Jk. This step updates the set of 
decoded terms deCk to be deck-i U {j € Jk ■ I-h,, , = 1} ™d 
updates the set of terms remaining for further consideration 
Jk+i = {j & Jk '■ ^Hkj = 0}- This completes the actions of 
step k of the loop. The idea is that on each step k we decode 
a substantial part of what remains, because of growth of the 
mean separation between terms sent and the others. 

III. Reliability 

Let Qk , fk be the fraction of correct detections and false 
alarms at step k. Also let = /i + . . . + /fe be the total 
fraction of false alarms after k steps. For the variable power 
case let = P^/P and use Qk = Ej sentnJk ^nk., and 
fk = J2jnotsentn.h 3 ' Weighted fractions, relative 
to the total weight of terms sent. 

It is not hard to see that q\^k = Ylij sent '^i ^'Hk,j is a lower 
bound on 5i + . . . + gfc the total weighted fraction of correct 
detections from steps 1 to fc. 

Let's specify a target false alarm rate /* that arise in our 
analysis for each step. For step k, for given a > 0, set 

^* = , /o^^ — cxp{-av/21ogi?-(l/2)a^} 
(V2 logS + a)v27r 

and likewise set values f > f * ■ Recall that the threshold r = 
^/2\ogB + a. Indeed, it is unlikely that fk exceeds /. 

Similarly, using distributional properties of qi^k using the 
function g{x) discussed below, we specify a value q-^^k for 
which we expect that qi^k is likely to be at least qi^k- Further 
define, a;o = and for fc > 1, 

J- + /i,fe/9i,fe 

where /i,fc = fc/. These Xk are used in setting the weight Afc 
and in expressing the mean separation Ukj between terms sent 
and terms not sent. Indeed Xk = Wk{l — x^v) with 

1 1 

l-Xkiy l-Xk-lV 

measuring the increase in a quantity used in specifying the 
separation. For establishing reliability, the critical matter is to 
demonstrate that Xk = q'l'^l grows to a value near 1. Define 



Wk 



XV 



1 ) y21ogS- a'. 



Here v = P/{a'^ + P) = 1 - e'^"^ and a' = a + h, where h 
is a small number positive number 

The 2^°^'' are not normally distributed, nevertheless, it is 
demonstrated by induction that in a set of high probability, they 
are greater than normal random variables which have mean 
for terms not sent and mean akj for terms sent. Across 



the terms j, the joint normal distribution that arises in this 
construction has a covariance I — i/k/3(i^/P where z^fe < = 
P/{P + cr^), for which it is shown that the joint density is not 
more than a constant 1/(1 — j/)^/^ = times the joint density 
that would arise if they were independent standard normal. 

In the constant power case with R = Rq, let g{x) = 
where iJ^ = Ma;(l)- Then for terms sent akj = — Mxfc-i and 
9i,fc = 9{^'k-i) at Xk-i = 'Ji.t-i- if 9{x) exceeds x, then 
there is room to set qi^k just below ql ^, so that if fi^k = kf 
is small enough, then Xk = q1% is indeed larger than Xk-i. 

The g{x) — x stays above a positive gap for all 0<.t<x*. 
For the constant power case the positivity holds at x* provided 
X* is separated from 1 by at least a polynomial in 1/B, and 
this gap at x* is the minimum value in [0,a;*] provided a' < 
A/27r(.5 - X*) and $(-a') > x* where x* = l-x*. 

Lemma 1: If g{x) — a; is at least a positive gap on an interval 

[0, x*], choose small positive -q and f > f * ■ Arrange A = 
gap — ry to be positive and for 4/ x* < and arrange qi^k = 
ql k — rj where ^ = fe i)- Then the increase on each 
step qi^k — 9i,fc-i for which q1'^l_i < x* is at least A, where 
A satisfies A = A — x* f / A, quadratic in A with solution 
A = A{1 + (l-4a;* f / A'^y/'^}/2. Moreover, the number of 



steps m required such that on step m — 1, the 



adj 



m—1 



first 



exceeds x*, is bounded by m < 1/A steps. At the final step 
exceeds g{x*) - 77. 
We also consider the variable power case. A quantity needed 
in our analysis is Ceji = -k^i) LvjlR. With ttj^) proportional 
to ut = e~2C(£-i)/L^ (jjjg becomes u^Cl/R, where the 
value Cl = {L/2){\ - e"^'^/^) is near the capacity C. Then 
Ci ji is near ui when R is near the capacity C. In the variable 
power case, the mean separation of the Zkj is given by ak,ji = 
for section L Likewise the role of the function 
g{x) is played by 

L 

9l{x) = 

1=1 

When 7r(£) is proportional to ui = g^-'^C(t-i)/L gL{x) is 
at least the value of a nearby integral 



9{x) = lf 



^{lJL:,{uC / R)) du. 



This g{x) is found to compare favorably to x, to yield the 
required growth of the Xk- 

Consider the case allowing leveling with which 7r(f) = 
max.{ue,cut} / sum, for which the normalizing sum is found 
to be {Lv/2C)[l + 5l^J\, where 51^,^ is near D{6c)lsnr, 
bounded by 5l/{2snr), with 4 = c/y/2 log B and D{6) = 
(1+5) log(l+5) — 5. The function gL{x) is defined as above 
with an analogous nearby integral with max{u, cut} in place 
of u. Set r > and consider the rate 

C 



R = 



(l + 5,\J(l + 5„)2(l + 2r/r|)' 



where r| 2{\ogB){l+5af with 5a = a! j ^2 \ogB. Setting 
a suitably small false alarm rate to not interfere with the 



accumulation of correct detections, the resulting da is of order 
[log log i? + log snr] /(log _B), so all three sources of rate drop 
above, (5^„„, and r/r^ are of order 1/logi? to within a 
loglog factor A relevant lemma is the following. 

Lemma 2: Let Xup be near 1 with 1 — Xup = {l/snr){2r /rg). 
For any non-negative a, c, and r, with the rate given above, 
the function g{x) — x for < a; < Xup, is minimized at x^p- 

The proof is based on an evaluation of the integral 
g{x) which has expression in terms of the variable z = 
HxicutC / R) which is one-to-one with x. The value Xup 
corresponds to a point z^p — C with favorable properties. 
Expressing the function in terms of z, one makes separate 
treatment of the behavior for z < where the function is 
close to decreasing, and for ~C < z < +(, where the function 
is close to symmetric, slightly skewed to be lower at +(. 

The value of ( is near c/2. Consider choices that 
approximately optimize the overall rate, yielding ( near 
■y/log^((logi3)/47r), at which the gap of g{x) — a; at Xup 
is at least a value near {l/snr){2r — 1/2)/t^, positive for 
r > 1/4. Moreover, choosing a such that the false alarm rate 
/ — 2f* equals {gap—riY/A, so that the conditions of Lemma 
1 are satisfied, it produces a value of &a of the indicated form. 

Let's state the result regarding reliability of the multi-step 
adaptive successive decoder The proof is based on the above- 
mentioned normal approximation bound and a large deviation 
bound for weighted combinations of Bernoulli random vari- 
ables, for which one may see the full manuscript |6|. 
Theorem 3: Suppose the communication rate and power 
allocation are such that g is accumulative, with g{x) — x > 
on [0,x*]. Pick r;^ = rj and f > f* such that the condi- 
tions of Lemma 1 are satisfied, or more generally arrange 
qi,k = 5(9i,t-i) - Vk so that the increase qi^k - qi,k-i 
remains positive for fc < to. If the penultimate step m — 1 
is such that m-i the first with value at least x*, then 
with rem = 1 — qi,m, the m step single-dictionary decoder 
incurs a fraction of errors less than m/ + rem, except in an 
event of probability not more than the sum for k from 1 to m 
of 

g-L,£>(9i,fc||9* J+cofe g-L,(B-l)£)(p||p*) _^ ^-(n-k+l)D,^ _ 

Here -D( - 1| •) refers to the Kullback-Leibler divergence between 
two Bernoulli random variables; p,p* equal the corresponding 
/, /* divided by B—l; and — — log(l — e) — e which 
is at least e-^/2. Also ek — (ne — fc + l)/(n — A: + 1), where 
e = 1 - (1 - /i/V21og(B) )2, and cq = C. Moreover, = 
1/ maxiTTi, approximately a constant multiple of L for the 
designs investigated here. 

To produce each step qi.fc from q'l j., one may set a constant 
difference rjk = ri and invoke the bound D{qi^k\\qi k) — '^V^- 
A preferred tactic, used in producing the curves shown earlier, 
is each step to choose qi^k to produce constancy of the 
exponent £^('Zi,fc||'7i fc) at a prescribed value, equalizing the 
contributions to the above probability bound from each step. 
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